Skip to content

fix(cluster_family): Cancel slot migration from incoming node on OOM #5000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mkaruza
Copy link
Contributor

@mkaruza mkaruza commented Apr 25, 2025

If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.

@mkaruza mkaruza requested a review from BorysTheDev April 25, 2025 11:16
@mkaruza mkaruza marked this pull request as draft April 25, 2025 11:16
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 9c88c11 to 6a5d621 Compare May 6, 2025 12:34
@mkaruza mkaruza requested a review from BorysTheDev May 6, 2025 12:35
@mkaruza mkaruza marked this pull request as ready for review May 6, 2025 12:35
@mkaruza mkaruza changed the title fix(cluster_family): Cancel slot migration in case of OOM errors on t… fix(cluster_family): Cancel slot migration from incoming node on OOM May 6, 2025
Comment on lines 985 to 987
if (migration->GetState() == MigrationState::C_FATAL) {
migration->Stop();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like incorrect place for this logic

Copy link
Contributor Author

@mkaruza mkaruza May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FLOW fails with C_FATAL we'll call it, where would you put migration->Stop to handle stopping of migration?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move it to reportError or SetState, where we set the fatal status

@@ -70,6 +76,20 @@ class ClusterShardMigration {
break;
}

auto oom_check = [&]() -> bool {
auto used_mem = used_mem_current.load(memory_order_relaxed);
if ((used_mem + tx_data->command.cmd_len) > max_memory_limit) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is used_mem RSS? maybe we need max_memory_limit - 100MB for example, or how we can guarantee 100% success of this condition?

Copy link
Contributor Author

@mkaruza mkaruza May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is checks whatever is set with maxmemory. I'll modify so it uses 90% of max memory as upper bound. That should be safe i think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss it with @adiholden or @romange

Comment on lines +57 to +60
void ReportFatalError(dfly::GenericError err) ABSL_LOCKS_EXCLUDED(state_mu_, error_mu_) {
errors_count_.fetch_add(1, std::memory_order_relaxed);
util::fb2::LockGuard lk_state(state_mu_);
util::fb2::LockGuard lk_error(error_mu_);
state_ = MigrationState::C_FATAL;
last_error_ = std::move(err);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this code. Let's think how to make it better. Maybe we can use state_mu_ for error too or merge this method with reportError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but i changed ReportError to check GetState() != C_FATAL and i wanted to update state & set error message while having both locks.

@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from e9049f0 to 16a27f6 Compare May 7, 2025 11:32
@mkaruza mkaruza requested a review from BorysTheDev May 7, 2025 11:33
@BorysTheDev BorysTheDev requested review from adiholden and kostasrim May 7, 2025 11:50
mkaruza added 2 commits May 7, 2025 20:40
If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.

Signed-off-by: mkaruza <[email protected]>
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 16a27f6 to f080a52 Compare May 7, 2025 18:54
@mkaruza mkaruza force-pushed the mkaruza/github#4977 branch from 7e7ebf4 to 40851f9 Compare May 7, 2025 19:41
@mkaruza mkaruza requested a review from BorysTheDev May 8, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants